The Utility of Intersubjective Methods in Forecasting

SPUDM 2025

Jessica Helmer, Sophie Ma Zhu, Nikolay Petrov, Ezra Karger, Mark Himmelstein

Typical Testing vs. Forecasting

Unlike typical testing, where items have a known answer . . .

○ What’s the capital of Italy?

In forecasting, we do not know the answer at the time of testing

○ What will be the population of Lucca in 2050?

To measure forecasting ability, usually score with outcome (ground truth)

○ But that means we have to wait for items to resolve

Intersubjective Measures

Using peer comparisons to score forecasts instead of using outcomes

Why would this work?

Wisdom of the Crowds: when many predictions are aggregated, individual errors tend to cancel out

○ Making aggregated prediction typically accurate and reliable

Intersubjective Measures

Using peer comparisons to score forecasts instead of outcomes

Why would this work?

Wisdom of the Crowds: when many predictions are aggregated, individual errors tend to cancel out

○ Making aggregated prediction typically accurate and reliable

These properties may make crowd aggregates a good substitution for the outcome to score in real time

Simulation

N = 1,000 forecasters

○ Varying skill levels

K = 1,000 items

○ Specified item “noisiness”

Generated forecasts for each forecaster on each item

○ Skill defined how far off a forecaster’s forecast was expected to be
○ Forecasters were calibrated on the variance of the item

Simulation

Sampled combinations of:

○ N = [2, 4, 8, 16, 32, 64] forecasters and
○ K = [2, 4, 8, 16, 32, 64] items for
\(\sigma\) = [1, 2]

Scored each forecast with:

○ ground truth
○ intersubjective scoring (absolute distance from aggregate forecast of group of size N)

Which scoring measure captures forecasters’ skill better?

Simulation: \(\sigma = 1\)

Intersubjective scoring correlation increases with N

For N ≥ 16, intersubjective scoring captures original skill parameter better than ground-truth

Simulation: \(\sigma = 2\)

Increased variance affects ground-truth scoring but not intersubjective scoring

Intersubjective scoring has stronger correlations with skill at lower NK combinations than before

Intersubjective Measures

Types of intersubjective measures

Proxy Scoring

○ Scoring by a forecast’s distance from the aggregate forecast

Metapredictions

○ Explicit predictions about crowd aggregates
○ What would the average person predict that the population of Lucca will be in 2050?

Tested for their real-time scoring ability

Surprisingly successful at predicting forecasting accuracy

But have only been explored in the context of the Probability Elicitation Format (PEF)

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?
  1. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

Superforecasters

Previously identified skilled forecasters

Aggregating these superforecasters may serve as a better reference criterion than aggregating the general crowd

Proxy scoring with a superforecaster crowd aggregate has been an effective method

○ But has only been implemented between-subjects

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?

  2. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

  1. Are superforecaster aggregates a better reference criterion than general crowd aggregates?

Methods

Final wave of a longitudinal forecasting study (N = 894)

Forecasts on six items in the QEF

Metapredictions after each item:

○ What do you think the average person would predict?
○ What do you think a superforecaster would predict?

Additional sample of N = 42 superforecasters

Scoring Methods

Scored each forecast with:

Ground Truth (own forecast distance from actual outcome)
○ proxy scoring (own forecast distance from crowd aggregate)
Proxy Crowd and Proxy Super
○ metaprediction accuracy (metaprediction distance from crowd aggregate)
Metaprediction Crowd and Metaprediction Super

Compared these scores to forecasting accuracy on separate set of thirty items

Aggregate Accuracy

Ground-truth score distributions

Superforecaster aggregate more accurate more often

Correlations

Superforecaster metapredictions and proxy scores strongest

Contributions to Forecasting Proficiency

How much variance in forecaster accuracy does each score explain?

Random effects for item and person, conducted dominance analysis

Contribution Proportion
Proxy Super .13 .23
Metaprediction Super .12 .22
Proxy Crowd .11 .20
Metaprediction Crowd .10 .18
Ground Truth .09 .17
\(R_{forecaster}^2\) .54 1.00

Select Crowds

Proxy scores effective way of finding select crowds to aggregate

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth
○ Less influenced by unpredictability in items
○ Superforecaster aggregates particularly useful

Can reduce measurement error by modifying the scoring criterion

○ Rather than modifying the task

References

Atanasov, P., & Himmelstein, M. (2023). Talent spotting in crowd prediction. In M. Seifert (Ed.), Judgment in Predictive Analytics. Springer.

Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management Science, 63(3), 691–706. https://doi.org/10.1287/mnsc.2015.2374

Galton, F. (1907). Vox Populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0

Himmelstein, M., Budescu, D. V., & Ho, E. H. (2023). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General, 152(5), 1223–1244. https://doi.org/doi.org/10.1037/xge0001340

Himmelstein, M., Zhu, S. M., Petrov, N., Karger, E., Helmer, J., Livnat, S., Bennett, A., Hedley, P., & Tetlock, P. (2025). The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability. OSF. https://doi.org/10.31234/osf.io/a7kdx

Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal Scoring: A Method for Forecasting Unanswerable Questions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3954498

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Wilkening, T., Martinie, M., & Howe, P. D. L. (2022). Hidden Experts in the Crowd: Using Meta-Predictions to Leverage Expertise in Single-Question Prediction Problems. Management Science, 68(1), 487–508. https://doi.org/10.1287/mnsc.2020.3919

Zhu, S. M., Budescu, D. V., Petrov, N., Karger, E., & Himmelstein, M. (2024). The psychometric properties of probability and quantile forecasts. Preprint.

Thank you!

Questions?

Contact: jhelmer3@gatech.edu